Chromosome-level genome assembly of largemouth bass (Micropterus salmoides) using PacBio and Hi-C technologies

The largemouth bass (Micropterus salmoides) has become a cosmopolitan species due to its widespread introduction as game or domesticated fish. Here a high-quality chromosome-level reference genome of M. salmoides was produced by combining Illumina paired-end sequencing, PacBio single molecule sequencing technique (SMRT) and High-through chromosome conformation capture (Hi-C) technologies. Ultimately, the genome was assembled into 844.88 Mb with a contig N50 of 15.68 Mb and scaffold N50 length of 35.77 Mb. About 99.9% assembly genome sequences (844.00 Mb) could be anchored to 23 chromosomes, and 98.03% assembly genome sequences could be ordered and directed. The genome contained 38.19% repeat sequences and 2693 noncoding RNAs. A total of 26,370 protein-coding genes from 3415 gene families were predicted, of which 97.69% were functionally annotated. The high-quality genome assembly will be a fundamental resource to study and understand how M. salmoides adapt to novel and changing environments around the world, and also be expected to contribute to the genetic breeding and other research.

www.nature.com/scientificdata www.nature.com/scientificdata/ In the present study, a novel high-quality chromosome-level genome assembly of largemouth bass was generated by single-molecule real-time sequencing combined with Illumina paired-end sequencing and Hi-C (   was constructed using 350 bp library data. A total of 49,157,214,151 K-mers were used for genomic length estimation after the removal of the K-mers with abnormal depth. The peak 19-mer depth was 56, and the genome size was calculated as 49,157,214,151/56 = 874.14 Mb. www.nature.com/scientificdata www.nature.com/scientificdata/ Sequencing libraries. Tissues from a two-year-old adult female largemouth bass (body weight 1487 g, length 36 cm), obtained from an aquaculture farm of Chongzhou, Sichuan province, China, were used to construct genomic DNA sequencing libraries (muscle) and transcriptome sequencing libraries (liver, brain, muscle, heart, kidney, gill, and gonad). All the tissues were stored in liquid nitrogen until use.
For short-read sequencing, genomic DNA was extracted from 500 mg of muscle using cetyl trimethylammonium bromide (CTAB) before chloroform purification. The genomic DNA was sonicated to a fragment size of 350 bp and the paired-end genomic library was prepared following the Illumina standard protocol, including terminal repair, polyA and adaptor addition, target fragment selection and PCR processes (Illumina, San Diego, CA, USA). The resulted library was quality checked using Agilent Bioanalyser 2100 and qPCR, and sequenced on an Illumina NovaSeq 6000 sequencing platform with paired-end 150 bp read layout.
For long-read sequencing, genomic DNA (~8 µg) was sheared into a large fragment by g-TUBE (Covaris), purified and recovered by AMpure PB magnetic beads, and used to construct single-molecule real-time bell (SMRTbell) sequencing libraries by the SMRTbell Template Prep Kit 2.0 (PacBio) 28 . The end-repaired fragments were size-selected using the Blue Pippin Size-Selection System (Sage Science, MA, USA), and damage-repaired using the SMRTbell Damage Repair Kit (PacBio). Then the products were combined polymerase using the PacBio DNA/Polymerase Kit before sequenced on the PacBio Sequel platform.
The full-length transcriptome was used to generate RNA data for gene prediction from a sample pool consisting of muscle, liver, gonad, kidney, gut, blood, and gills. Total RNA was extracted by TRIzol extraction reagent (Invitrogen, USA) according to the manufacturer's protocol. RNA purity was checked using the NanoPhotometer spectrophotometer (IMPLEN, CA, USA). RNA concentration was measured using Qubit RNA Assay Kit in Qubit 2.0 Flurometer (Life Technologies, CA, USA). Then, these tissues RNA were equally mixed to product cDNA using the SMARTer PCR cDNA Synthesis Kit and sequencing by one SMRT flow cell  www.nature.com/scientificdata www.nature.com/scientificdata/ on the PacBio Sequel platform. Raw reads were processed into error corrected reads of insert (ROIs) using Iso-seq pipeline with minFullPass = 0 and minPredictedAccuracy = 0.90. Next, full-length, non-chemiric (FLNC) transcripts were determined by searching for the polyA tail signal and the 5′ and 3′ cDNA primers in ROIs. Full-length consensus sequences obtained from ICE (Iterative Clustering for Error Correction) were polished using Quiver. Finally, Full-length transcriptome sequencing yielded 20 Gb of clean data, including 26,369 high-quality consensus isoforms sequences with an average length of 2,895 bp.
Genome survey and assembly. The size, heterozygosity, and repetitive sequences in the M. salmoides genome were estimated by the analysis of k-mer frequency distribution of Illumina paired-end reads using the kmer_freq_stat script (Biomarker Technologies, Beijing, China), based on the formula G = (N k-mer -Nerror_k-mer)/D (where G: genome size; N k-mer: the number of k-mers; Nerror_k-mer: the number of depth 1 k-mers; D: the k-mer depth). After removing the k-mers with abnormal depth, a total of 49.16 M k-mers were obtained with a k-mers peak at a depth of 56 (Fig. 2). A total of 58.51 Gb high-quality filtered data was generated from the Illumina short read DNA library, with 66.94 × genome coverage, a Q20 of 96.63% and a Q30 of 91.36% ( Table 1). The genome size was estimated at 874.14 Mb, with 0.12% heterozygosity, 30.03% repetitive sequences, and 40.88% GC content (Table 1).
For long-read sequencing, reads longer than 500 bp generated by the PacBio Sequel platform were collected and a de novo genome was assembled initially using SMARTdenovo 29 based on the data corrected by Canu v. 1.5 30 . Subsequently, three rounds of refinement of the de novo genome were performed using Pilon 31 www.nature.com/scientificdata www.nature.com/scientificdata/ by Illumina short read sequencing data. Finally, the long-read SMRTbell library generated a total of 94.69 Gb (112.07 × genome coverage) with a reads N50 of 35.34 kb and an average read length of 24.75 kb. After error correction and assembly, an 844.88 Mb genome was assembled from 265 contigs with a N50 of 15.68 Mb (Table 1).

Hi-C analysis and chromosome assembly.
Hi-C libraries were prepared as previously reported 32,33 .
Briefly, muscle tissue cells were fixed with formaldehyde to maintain the 3D structure of DNA in cells and the cells were digested using restriction endonuclease Hind III. Then, biotin-labeled bases were introduced using the DNA terminal repair mechanism. DNA (4 µg) was fragmented by a Covaris S220 focused-ultrasonicator (Gene Company Limited, Hong Kong) and 300-700 bp fragments were recovered. The DNA fragments containing interaction relationships were captured by streptavidin immunomagnetic beads for library construction. Library concentration and insert size were determined using the Qubit 3.0 and LabChip GX platforms (PerkinElmer), respectively. qPCR was used to estimate the effective concentration of the library. High quality Hi-C libraries were sequenced on the Illumina NovaSeq 6000 sequencing platform, and the sequencing data were used for chromosome-level assembly 34 . The software Burrows-Wheeler Aligner (BWA-MEM v. 0.7.10-r789) was used to align the sequencing pair-end clean reads with the sequence of the assembled genome to obtain the uniquely mapped read pairs 35 . The uniquely mapped read pairs were processed using HiC-Pro 36 . The genome contigs,  www.nature.com/scientificdata www.nature.com/scientificdata/ split into 50 kb segments, combined with uniquely matched Hi-C data, were clustered, ordered and directed onto the pseudochromosomes using LACHESIS 34 with the following parameters: CLUSTER_MIN_RE_SITES = 30; CLUSTER_MAX_LINK_DENSITY = 2; CLUSTER_NONINFORMATIVE_RATIO = 2; ORDER_MIN_N_ RES_IN_TRUN = 68; ORD-ER_MIN_N_RES_I-N_SHREDS = 67. Finally, the chromosome assemblies were cut into 100 kb bins of equal lengths and the interaction signals generated by the valid mapped read pairs between each bin were visualized in a heat map.
In total, 277.88 million read pairs (77.53 Gb clean data; 94.06 × coverage of the genome) were generated from the Hi-C library (Table 1), of which 77.26% were uniquely mapped on the assembled genome. Of the unique mapped read pairs, 60.67% were the valid interaction pairs (130.26 million), which were used for the next Hi-C assembly (Table S1). A total of 844.00 Mb (99.9%) assembled genome sequences were anchored on 23 chromosomes, and the order and direction of 827.39 Mb (98.03%) sequences could be determined. The detailed distribution of each chromosome sequence was shown in Table 2. The heat map of the Hi-C assembly interaction bins is consistent a genome assembly of excellent quality (Fig. 3) Genes prediction and annotation. The prediction of the genome gene structure was based on three different strategies: ab initio-based, homolog-based, and unigene-based. Genscan 46 54 were used for assembly based on reference transcripts, and TransDecoder v2.0 and GeneMarkS -t v5.1 55 were used for gene prediction. PASA v2.0.2 56 was used to predict unigene sequences based on unreferenced assembly of full-length transcriptome data. Finally, EVM v1.1.1 57 was used to integrate the prediction results obtained by the above three methods, and PASA v2.0.2 was used to modify the final gene models. A total of 26,370 protein-coding genes were predicted by integrating the prediction of ab initio, homology-based and RNA-seq strategies (Table S2), with average gene length of 14,483 bp, exon length of 2,601 bp, coding sequence of 1,724 bp and intron length of 11,882 bp (Table 4). Finally, 25,760 genes (97.69% of the total) were successfully annotated GO, KEGG, KOG, TrEMBL, and NR database (Table S3).
Blastn searches using the Rfam database 58 , as input against the M. salmoides genome was used to identify microRNA and rRNA and tRNAscan-SE 59 was used to identify tRNA. Non-coding RNAs were predicted to be 2,639, including 633 microRNAs (miRNA) of 84 families, 230 rRNA genes of 4 families and 1,830 tRNA genes of 25 families (Table S4). Pseudogenes were predicted in the following way. The predicted protein sequences were used to search for homologous gene sequences (putative genes) through BLAT alignment 60 . Then GeneWise 61 was used to search for immature termination codons and code-shifting mutations in the gene sequences to obtain pseudogenes. In total, 986 pseudogenes were identified with a total length of 5,885,501 bp and an average length of 5,969 bp (Table S4).  Table 4. The basic information statistics of assembled genome.